Intent and Category Classification

Name: Vinaya Mahesh Chinti

Given the dataset, the challenge is to build a machine learning model that will classify text input into to these categories and intents (including OOD if it is none of the other intents) and try to measure how well your model is doing.

Reading and Analyzing Data

Checking if the data is balanced

The analysis shows that each of the categories except the OOD category has 2250 records. The OOD category has 1350 records Each category has specific intents and there is a one-to-one mapping between the intent and the category. Each category has 15 intents and the category OOD has one intent which is OOD itself. Each category has 150 records for each intent. From this, we can conclude that the data is balanced.

Data Preprocessing and Word Token Analysis

In the below blocks of code, the data is being pre-processed. The text column is used for data preprocessing. Following steps are applied for preprocessing 1) The text is converted to lower case.

2) Digits and special symbols are removed from the text.

3) The text is pre-processed to remove the stop words.

4) A new column named clean_data is created to store the pre-processed data.


Word Cloud is used the analyze the text for each category. From the visualized word cloud we can conclude the following things: 1) In the banking category the common words used are related to checking the amount, routing numbers, bills, payments, and interests.

2) In the credit_card category the words are related to the credit card limits, rewards, credit score, and queries regarding card lost or declined.

3) The small_talk category involved questions related to the meaning of life, pets, facts, and humans.

4) The OOD category includes everything which is not related to any of the above-mentioned categories and tends to have things related to cars and phones.

Displaying top 20 words used in the text column

The following block of code is used to analyze the top words that are encountered in the text. It gives us idea about what mostly the people tend to ask. From the graph it is observed that top 20 words are related to the banking and credit_card .

Splitting the data into train and test Datasets

Functions used for Model Training and Prediction

I have formulated 2 solutions for the classification problem. the dataset has intent and categories and the model needs to predict both the columns. I have used Multinomial Naive Bayes and Support Vector Classifier for classification. The performance of the Multinomial Naive Bayes classifier and Support Vector Classifier is compared using accuracy, classification report, and the processing time required for training the models.



The first solution uses a stacked approach which consist of the following steps: 1) Developing a Multinomial Naive Bayes classifier which is trained for predicting the category

2) Filtering the data for each category and then splitting it into train and test datasets

3) Developing Multinomial Naive Bayes classifier for each of the category types which will be used for predicting the intent

4) Using the model to predict the category and then based on the category predicted using the category-specific model to predict the intent for the category

Similar steps are followed by using the Support Vector Classifier as well.



The second solution uses a single model approach which consist of the following steps:

1) Creating a dictionary consisting of the mapping between the category and its intents

2) Developing a Multinomial Naive Bayes for predicting the intent directly on the full dataset

3) Iterating through the dictionary to get the category for the predicted intent

Similar steps are followed by using the Support Vector Classifier as well.

Model Training and Prediction: Stacked Approach

Using the Trained models for predicting the category and intent on the user input

Multinomial Naive Bayes and Support Vector Classifier is used for model training. The Support Vector Classifier achieved greater accuracy during cross-validation and on the test data as well when compared to the Multinomial Naive Bayes Classifier. This is evident for the category classification as well as the category-specific intent classification. The precision has improved significantly with the Support Vector Classifier. Therefore for prediction on the user input, I have used Support Vector Classifier.

Model Training and Prediction: Single Model Approach

Predicting the Intent using the Single Model Approach

For the intent prediction as well the Support Vector Classifier has performed well as compared to the multinomial Naive Bayes algorithm. So for prediction, the SVC model has been used. The intent is predicted using the model and then the dictionary is iterated to find the category mapped to the intent.

Conclusion

When the stacking and the dictionary approach are compared it is observed that the stacking approach has performed better as it has greater accuracy. Moreover, using the second approach it is difficult for the model to generalize over 46 classes. Considering the processing time, the stacking approach performed much better. Although processing time is not a concern as it can be improved by using cloud platforms for training the models. Considering the processing time and the performance metrics I would recommend using Support Vector Machine with the stacking approach.